Charles Peterson
In this workshop, we will using Big Data techniques with High-Performance Computing Resources
We will go over:
General concepts of Big Data
Simple examples on Hoffman2
Any suggestions for upcoming workshops, email me at cpeterson@oarc.ucla.edu
This presentation can be found on our UCLA OARC’s github repo
https://github.com/ucla-oarc-hpc/WS_BigDataOnHPC
View slides:
Note
This presentation was build with Quarto and RStudio.
BigDataHPC.qmdThe term Big Data refers when data sets, data processing, data modeling, and other data science tasks become to large and complex.
Traditional techniques are not enough to solve these projects.
image source - Stable Diffusion https://github.com/CompVis/stable-diffusion
If you use data, then YES!
Big Data can solve problems for all types of research areas.
Big Data is used to scale up research.
Problems that arise from increasing data projects
High-Performance Computing resources can help solve Big Data challenges by providing more computing power than typical workstations.
Big Data can be described by
Understanding you data can help you choose what Big Data techniques to run.
Scaling Data Size
Scaling Model/Task Size
image source - DASK https://ml.dask.org/index.html
Various frameworks, APIs, and libraries for Big Data projects
RDD - Resilient Distributed Dataset
Large Datasets can be spread over multiple compute nodes
Spark supports different levels of persistence
Along with RDDs, Spark also has a API DataFrame (Pandas-like)
This is a SQL-like library that can collect data with named columns
Great for structured/semi-structured data
The SparkSession is the entry point for Spark
spark = SparkSession.builder \
.appName("MyPySpark") \
.config("spark.driver.memory", "15g") \
.getOrCreate()The SparkContext is the entry point for RDD
spark object from SparkSession.builderEasiest way to install PySpark is by anaconda3.
This is great when running PySpark on a single compute node.
module load anaconda3/2022.05
conda create -n mypyspark openjdk pyspark python=3.9 \
pyspark=3.3.0 py4j jupyterlab findspark \
h5py pytables pandas \
-c conda-forge -y
conda activate mypyspark
pip install ipykernel
ipython kernel install --user --name=mypysparkThis will create a conda env named, mypyspark, with access to Jupyter
This conda env will have both Spark and PySpark installed
Note
Information on using Anaconda can be found from a previous workshop
Lets go over basic PySpark functions
spark-ex1 from github repoMake sure you download this Workshop on github in your Hoffman2 scratch directory
cd $SCRATCH
git clone https://github.com/ucla-oarc-hpc/WS_BigDataOnHPC
cd WS_BigDataOnHPC
cd spark-ex1Spark_basics.ipynbIn this example, we will use data from Project Gutenberg
We will download “The Hound of the Baskervilles, by Arthur Conan Doyle
We will use the h2jupynb script to start Jupyter on Hoffman2
You will run this on your LOCAL computer.
wget https://raw.githubusercontent.com/rdauria/jupyter-notebook/main/h2jupynb
chmod +x h2jupynb
#Replace 'joebruin' with you user name for Hoffman2
#You may need to enter your Hoffman2 password twice
python3 ./h2jupynb -u joebruin -t 5 -m 10 -e 2 -s 36 -a intel-gold\\* \
-x yes -d /SCRATCH/PATH/WS_BigDataOnHPCNote
The -d option in the python3 ./h2jupynb will need to have the $SCRATCH/WS_BigDataOnHPC full PATH directory
This will start a Jupyter session on Hoffman2 with ONE entire intel-gold compute node (36 cores)
More information on the h2jupynb can be found on the Hoffman2 website
This example will use Spark’s Machine Learning library (MLlib)
We will use data from the Million song subset
This subset has ~500,000 songs with:
Download the dataset
We will use multiple nodes to run Spark
In the previous Basic operations example, we used pyspark with 1 (36-core) compute node.
We will run PySpark over multiple nodes.
This will:
To do this, we will NOT use the Spark installation from our conda install, but use spark from a build that will will download from the spark website.
mkdir -pv $SCRATCH/WS_BigDataOnHPC/apps/spark
cd $SCRATCH/WS_BigDataOnHPC/apps/spark
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar -vxf spark-3.3.0-bin-hadoop3.tgzNote
Though we will not use the Spark from conda, we will still use the PySpark package that was install with conda. The Spark and PySpark packages will need to be the same version (3.3.0 in this example)
Since we are using our Spark build that we just downloaded, we will start spark and submit it as a job, then connect to jupyter.
spark-ex2
pyspark-multi-jupyter.job$SPARK_HOME/bin/pysparkIn this example, we will use 3 compute nodes in total.
Tip
For large data jobs, I like to have the Spark driver to be separate from the workers.
Large data jobs may require the Spark driver to have a heavy CPU load and memory.
# Replace NODENAME with the name of the node
# Replace joebruin with you Hoffman2 user name
ssh -L 8888:NODENAME:8888 joebruin@hoffman2.idre.ucla.eduThis will create a SSH tunnel to the compute node so we can open Jupyter at http://localhost:8888
Then we can open the notebook named spark-ML.ipynb
Spark has a visual dashboard that can view the tasks in real-time
By default, Spark will run this dashboard on port 4040
Create a ssh tunnel to the compute node to view the dashboard on your local machine
You will need to replace NODENAME with the compute node that has your Spark job
You can run Spark has a batch job to run non-interactively.
pyspark located at:
$SPARK_HOME/bin/spark-submitI do have another Machine Learning example for Spark that I may not have time to go over in this workshop
In this example, will train a Machine Learning model using data from LIBSVM
spark-bonus
ML-bonus.ipynbDask is a parallel computing library for Python
Dask uses multiple cores to run tasks
Can use GPUs to speed up tasks
Dask has an Arrays API created from NumPy-like chunks
Dask Arrays can process chuck over multiple cores.
Great for larger than memory Arrays. Chunks are compute in memory
Dask Arrays have most of all methods and funtions as Numpy objects
Image source - https://docs.dask.org/en/stable/
Dask DataFrames are Pandas-like objects
conda create -n mydask python pandas jupyterlab joblib \
dask dask-ml nodejs graphviz python-graphviz
-c conda-forge -y
conda activate mydask
pip install ipykernel
ipython kernel install --user --name=mydaskThis will create a conda env, mydask, that will have
We will use the h2jupynb script to start Jupyter on Hoffman2
You will run this on your LOCAL computer.
python3 ./h2jupynb -u joebruin -t 5 -m 10 -e 2 -s 36 -a intel-gold\\* -x yes \
-d /SCRATCH/PATH/WS_BigDataOnHPCReplace joebruin with your Hoffman2 user account.
Replace /SCRATCH/PATH/WS_BigDataOnHPC with the full PATH name of the workshop on Hoffman2
dask-ex1
dask_basic.ipynbDask has a Dask-ML library with scalable Machine Learning methods
There is also integration with:
This example will use Scikit-Learn with Dask
We will use data from the Million song subset
cd $SCRATCH/WS_BigDataOnHPC
cd dask-ex2
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip
unzip YearPredictionMSD.txt.zipjoebruin with your Hoffman2 user account/SCRATCH/PATH/WS_BigDataOnHPC with the full PATH name of the workshop on Hoffman2python3 ./h2jupynb -u joebruin -t 5 -m 10 -e 2 -s 36 -a intel-gold\\* -x yes -d /SCRATCH/PATH/WS_BigDataOnHPCMSD-dask.ipynbDask has a visual dashboard that can view the tasks in real-time
You will need to replace NODENAME with the compute node that has your Dask job
This is a taste of what Big Data can do.
Spark and Dask are just two popular frameworks for Big Data